As the ENTS features that have been extracted in this notebook are having some problems it may be worthwhile to integrate the predictions ENTS makes about the human genome into the classifier. This is done using a file containing the predictions of the ENTS classifier over the human genome:



In [1]:

    
cd ../../ents/









    



/data/opencast/MRes/ents



In [8]:

    
ls









    



9606_0.50_predictions@  human.ENTS.features.conservative.pickle@  human.ENTS.features.pickle@  standalone/



In [10]:

    
!head 9606_0.50_predictions









    



ENSP00000268854	ENSP00000371119	0.52
ENSP00000364687	ENSP00000371119	0.56
ENSP00000350698	ENSP00000371119	0.57
ENSP00000335615	ENSP00000371119	0.58
ENSP00000238146	ENSP00000371119	0.53
ENSP00000343745	ENSP00000371119	0.61
ENSP00000258772	ENSP00000371119	0.50
ENSP00000298852	ENSP00000400157	0.63
ENSP00000310572	ENSP00000400157	0.64
ENSP00000274008	ENSP00000400157	0.56

These are simply Ensembl pairs with corresponding confidence values in the interactions existing. Using a method similar to the at used to extract the STRING summary features we can make an object to return these values in our feature vector assembler. The first step is to load the dictionary mapping between Entrez and Ensembl IDs:



In [11]:

    
cd ../geneconversion/









    



/data/opencast/MRes/geneconversion



In [12]:

    
import pickle



In [13]:

    
f = open("human.gene2ensemble.pickle")
gene2ensembl = pickle.load(f)
f.close()

As before, invert this dictionary:



In [14]:

    
ensembl2gene = {}
for k in gene2ensembl:
    try:
        for p in gene2ensembl[k]:
            ensembl2gene[p] += [k]
    except KeyError:
        for p in gene2ensembl[k]:
            ensembl2gene[p] = [k]



In [15]:

    
cd ../ents/









    



/data/opencast/MRes/ents

As before, build a dictionary mapping Entrez Gene Pairs as frozensets to these prediction values:



In [16]:

    
import csv
import itertools



In [19]:

    
import pdb



In [22]:

    
f = open("9606_0.50_predictions")
c = csv.reader(f, delimiter="\t")
# no header this time
entsdict = {}
# iterate over rows building dictionary:
for l in c:
    #first build the (possibly various) keys
    try:
        geneids1 = ensembl2gene[l[0]]
        geneids2 = ensembl2gene[l[1]]
    except KeyError:
        #pdb.set_trace()
        #give up on pair if they can't be mapped to Entrez
        continue
    #then iterate over their combinations saving the feature vector each entry
    for i1,i2 in itertools.product(geneids1,geneids2):
        entsdict[frozenset([i1,i2])] = l[2]
f.close()

Then we just import the class we used to save the STRING results based feature again, instantiate it and pickle it:



In [24]:

    
import sys



In [25]:

    
sys.path.append("../opencast-bio/")



In [26]:

    
import ocbio.ppipred



In [27]:

    
entsfeatures = ocbio.ppipred.features(entsdict,1)



In [28]:

    
f = open("human.Entrez.ENTS.summary.pickle","wb")
pickle.dump(entsfeatures,f)
f.close()